Skip to content
This repository was archived by the owner on Sep 10, 2025. It is now read-only.

Conversation

@kwen2501
Copy link
Contributor

@kwen2501 kwen2501 commented Oct 3, 2024

TiktokenTokenizer (used by llama3) and SentencePieceProcessor (used by llama2) seem to have different requirement on input shape (1D list or 2D list).

Adding a condition to separate the input prep for the two.

Also improved some logging.

@kwen2501 kwen2501 requested a review from lessw2020 October 3, 2024 08:02
@pytorch-bot
Copy link

pytorch-bot bot commented Oct 3, 2024

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/torchchat/1257

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit e54c8d1 with merge base 34d6831 (image):
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Meta Open Source bot. label Oct 3, 2024
res = torch.cat(res, dim=1)
res_list = res.tolist()
responses = tokenizer.decode(res_list)
if isinstance(tokenizer, TiktokenTokenizer):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit - I think this type of check is better done directly in the _build_chat_tokenizer function and then we just have an enum for the tokenizer type and can correctly err out if a not recognized tokenizer.
The reason for that is two fold:
a - we future proof ourself so that if llama4 has a new tokenizer we are not going back through the code trying to figure where all the 'tokenizer' type checks are, and updating.
b - we only check once in a logical point, and then all other code uses the enum, and we have an upfront single failure point to err out if we are hitting an unrecognized tokenizer.
As this code is currently, it assumes that if not tiktoken then it must be sentencepiece which is a brittle assumption long term.

Copy link
Contributor

@lessw2020 lessw2020 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for the update!
I left a comment about making the tokenizer check more robust long term, but this is fine to land.

@lessw2020 lessw2020 merged commit 32241ff into main Oct 3, 2024
52 checks passed
@kwen2501
Copy link
Contributor Author

kwen2501 commented Oct 4, 2024

Thanks @lessw2020 . Do you know why TiktokenTokenizer cannot decode 2D list?

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

CLA Signed This label is managed by the Meta Open Source bot.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants